explicit planning
Explicit Planning for Efficient Exploration in Reinforcement Learning
Efficient exploration is crucial to achieving good performance in reinforcement learning. Existing systematic exploration strategies (R-MAX, MBIE, UCRL, etc.), despite being promising theoretically, are essentially greedy strategies that follow some predefined heuristics. When the heuristics do not match the dynamics of Markov decision processes (MDPs) well, an excessive amount of time can be wasted in travelling through already-explored states, lowering the overall efficiency. We argue that explicit planning for exploration can help alleviate such a problem, and propose a Value Iteration for Exploration Cost (VIEC) algorithm which computes the optimal exploration scheme by solving an augmented MDP. We then present a detailed analysis of the exploration behaviour of some popular strategies, showing how these strategies can fail and spend O(n^2 md) or O(n^2 m + nmd) steps to collect sufficient data in some tower-shaped MDPs, while the optimal exploration scheme, which can be obtained by VIEC, only needs O(nmd), where n, m are the numbers of states and actions and d is the data demand. The analysis not only points out the weakness of existing heuristic-based strategies, but also suggests a remarkable potential in explicit planning for exploration.
df3aebc649f9e3b674eeb790a4da224e-AuthorFeedback.pdf
T able 1: Robustness to model mismatch. Top-1 accuracy of SIPS at the third time quartile (Q3), evaluated on data generated by humans, RL agents, and mismatched models. We ran SIPS assuming r =2, q =0.95, T =10, and a Manhattan ( h Matched parameters are starred (*). We thank the reviewers for engaging carefully with our paper, and for providing helpful and constructive feedback. We will expand on these experiments in the final paper with more domains and cross-method comparisons.
Reviews: Explicit Planning for Efficient Exploration in Reinforcement Learning
This paper introduces the interesting idea of demand matrices to more efficiently do pure exploration. Demand matrices simply specific the minimum number of times needed to visit every state-action pair. This is then treated as an additional part of the state in an augmented MDP, which can then be solved to derive the optimal exploration strategy to achieve the specified initial demand. While the idea is interesting and solid, there are downsides to the idea itself and some of the analysis in this paper that could be improved upon. There are no theoretical guarantees that using this algorithm with a learned model at the same time will work.
Explicit Planning for Efficient Exploration in Reinforcement Learning
Efficient exploration is crucial to achieving good performance in reinforcement learning. Existing systematic exploration strategies (R-MAX, MBIE, UCRL, etc.), despite being promising theoretically, are essentially greedy strategies that follow some predefined heuristics. When the heuristics do not match the dynamics of Markov decision processes (MDPs) well, an excessive amount of time can be wasted in travelling through already-explored states, lowering the overall efficiency. We argue that explicit planning for exploration can help alleviate such a problem, and propose a Value Iteration for Exploration Cost (VIEC) algorithm which computes the optimal exploration scheme by solving an augmented MDP. We then present a detailed analysis of the exploration behaviour of some popular strategies, showing how these strategies can fail and spend O(n 2 md) or O(n 2 m nmd) steps to collect sufficient data in some tower-shaped MDPs, while the optimal exploration scheme, which can be obtained by VIEC, only needs O(nmd), where n, m are the numbers of states and actions and d is the data demand.
Explicit Planning Helps Language Models in Logical Reasoning
Zhao, Hongyu, Wang, Kangrui, Yu, Mo, Mei, Hongyuan
Language models have been shown to perform remarkably well on a wide range of natural language processing tasks. In this paper, we propose LEAP, a novel system that uses language models to perform multi-step logical reasoning and incorporates explicit planning into the inference procedure. Explicit planning enables the system to make more informed reasoning decisions at each step by looking ahead into their future effects. Moreover, we propose a training strategy that safeguards the planning process from being led astray by spurious features. Our full system significantly outperforms other competing methods on multiple standard datasets. When using small T5 models as its core selection and deduction components, our system performs competitively compared to GPT-3 despite having only about 1B parameters (i.e., 175 times smaller than GPT-3). When using GPT-3.5, it significantly outperforms chain-of-thought prompting on the challenging PrOntoQA dataset. We have conducted extensive empirical studies to demonstrate that explicit planning plays a crucial role in the system's performance.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > New York (0.04)
- Workflow (0.66)
- Research Report (0.64)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.75)
- (4 more...)
Explicit Planning for Efficient Exploration in Reinforcement Learning
Zhang, Liangpeng, Tang, Ke, Yao, Xin
Efficient exploration is crucial to achieving good performance in reinforcement learning. Existing systematic exploration strategies (R-MAX, MBIE, UCRL, etc.), despite being promising theoretically, are essentially greedy strategies that follow some predefined heuristics. When the heuristics do not match the dynamics of Markov decision processes (MDPs) well, an excessive amount of time can be wasted in travelling through already-explored states, lowering the overall efficiency. We argue that explicit planning for exploration can help alleviate such a problem, and propose a Value Iteration for Exploration Cost (VIEC) algorithm which computes the optimal exploration scheme by solving an augmented MDP. We then present a detailed analysis of the exploration behaviour of some popular strategies, showing how these strategies can fail and spend O(n 2 md) or O(n 2 m nmd) steps to collect sufficient data in some tower-shaped MDPs, while the optimal exploration scheme, which can be obtained by VIEC, only needs O(nmd), where n, m are the numbers of states and actions and d is the data demand.